knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lubridate)
library(viridis)
library(leaflet)
lat_long = read.csv("./data/lat-long.csv")
\(~\)
The motivation of this project was to use the data, the New York City (NYC) Department of Health and Mental Hygience (DOHMH) Dog Licensing Dataset.This dataset spans various boroughs of NYC and will help to deepen our understanding of the current patterns and trends among canines in NYC.
\(~\)
We sought to understand the current landscape of licensed dogs in New York City. Additionally, we were also interested in comparing these demographic factors among the five boroughs. Furthermore, we wanted to deepen our understanding on the current patterns of licensing between 2017 - 2019 and examine if there were variations by year or month of license. Finally, we sought to find ways of predicting a dog’s age based on the borough and the gender of the dog. Based on this information, our questions are:
\(~\)
We utilized the NYC Dog Licensing dataset. All dog owners who reside in NYC are required by law to license their dogs. This is a publicly available dataset sourced from the Department of Health and Mental Hygiene (DOHMH) Dog Licensing System where owners are able to apply for and renew dog licenses. This dataset was created on September 19, 2017 and was last updated on July 25, 2019. There are 10 variables in this dataset with a raw number of 345,727 observations. The variables provided in the raw dataset are as follows:
\(~\)
\(~\)
lat_lon = read.csv("./data/lat-long.csv")
dogz = read.csv("./data/NYC_Dog_Licensing_Dataset.csv") %>%
janitor::clean_names() %>%
filter(breed_name != "Unknown",
extract_year-animal_birth_month < 20) %>%
mutate(animal_name = str_to_sentence(animal_name),
borough = ifelse(zip_code %in% c(10001:10282),"Manhattan", borough),
borough = ifelse(zip_code %in% c(10301:10314), "Staten Island", borough),
borough = ifelse(zip_code %in% c(10451:10475), "Bronx", borough),
borough = ifelse(zip_code %in% c(11004,11101:11106,11109,11351,11354:11375,11377:11379,11385,11411:11423,11426:11430,11432:11436,11691:11694,11697), "Queens", borough),
borough = ifelse(zip_code %in% c(11201,11203:11226,11228:11239,11241:11243,11249,11252,11256), "Brooklyn", borough),
borough = as.factor(borough),
breed_name = as.character(breed_name),
license_issued_date = as.Date(license_issued_date, format = "%m/%d/%Y"),
license_expired_date = as.Date(license_expired_date, format = "%m/%d/%Y"),
license_length = year(license_expired_date) - year(license_issued_date),
dog_age = 2019 - animal_birth_month,
breed_name = ifelse(breed_name == "American Pit Bull Terrier/Pit Bull", "American Pit Bull Mix / Pit Bull Mix", breed_name),
breed_name = ifelse(breed_name == "American Pit Bull Terrier/Pit Bull", "American Pit Bull Mix / Pit Bull Mix", breed_name),
breed_name = ifelse(breed_name == "Australian Cattle Dog", "Australian Cattledog", breed_name),
breed_name = ifelse(breed_name == "Bassett Hound", "Basset Hound", breed_name),
breed_name = ifelse(breed_name == "Brittany", "Brittany Spaniel", breed_name),
breed_name = ifelse(breed_name == "Coonhound, Black and Tan", "Black and Tan Coonhound", breed_name),
breed_name = ifelse(breed_name == "Coonhound, Black and Tan", "Black and Tan Coonhound", breed_name),
breed_name = ifelse(breed_name == "Coonhound, Blue Tick", "Bluetick Coonhound", breed_name),
breed_name = ifelse(breed_name == "Coonhound, Treeing Walker", "Treeing Walker Coonhound", breed_name),
breed_name = ifelse(breed_name == "Cotton de Tulear", "Coton de Tulear", breed_name),
breed_name = ifelse(breed_name == "Japanese Chin/Spaniel", "Japanese Chin", breed_name),
breed_name = ifelse(breed_name == "Jindo Dog, Korea", "Jindo", breed_name),
breed_name = ifelse(breed_name == "Mastiff, French (Dogue de Bordeaux)", "Dogue de Bordeaux", breed_name),
breed_name = ifelse(breed_name == "Pharoh hound", "Pharaoh Hound", breed_name),
breed_name = ifelse(breed_name == "Schipperke", "Schipperkee", breed_name),
breed_name = ifelse(breed_name == "Schnauzer, Miniature", "Miniature Schnauzer", breed_name),
breed_name = ifelse(breed_name == "Schnauzer, Miniature Crossbreed", "Miniature Schnauzer, Crossbreed", breed_name),
breed_name = ifelse(breed_name == "Welsh Corgi, Cardigan", "Cardigan Welsh Corgi", breed_name),
breed_name = ifelse(breed_name == "Welsh Corgi, Pembroke", "Pembroke Welsh Corgi", breed_name),
breed_name = ifelse(breed_name == "Bull Dog, French", "French Bulldog", breed_name),
breed_name = ifelse(breed_name == "Collie, Bearded", "Bearded Collie", breed_name),
breed_name = ifelse(breed_name == "Bullmastiff", "Mastiff, Bull", breed_name),
breed_name = ifelse(breed_name == "Neapolitan Mastiff", "Mastiff, Neapolitan", breed_name),
breed_name = ifelse(breed_name == "Tibetan Mastiff", "Mastiff, Tibetan", breed_name),
breed_name = ifelse(breed_name == "Pointer, German Shorthaired", "German Shorthaired Pointer", breed_name),
breed_name = ifelse(breed_name == "Pointer, German Wirehaired", "German Wirehaired Pointer", breed_name),
breed_name = ifelse(breed_name == "Schnauzer, Giant", "Giant Schnauzer", breed_name),
breed_name = ifelse(breed_name == "Schnauzer, Standard", "Standard Schnauzer", breed_name),
breed_name = ifelse(breed_name == "Chinese Shar-Pei", "Shar-Pei, Chinese", breed_name)) %>%
drop_na(borough) %>%
rename(animal_birth_year = animal_birth_month, license_id = row_number) %>%
select(-extract_year)
dogz_1 =
dogz %>%
distinct(animal_name, animal_gender, animal_birth_year, breed_name, .keep_all = TRUE) %>%
select (-license_issued_date, -license_expired_date, -license_length)
distinct_dogz =
left_join(dogz_1, lat_lon, by = "zip_code") %>%
select(-City,-State,-Timezone,-Daylight.savings.time.flag,-geopoint) %>%
janitor::clean_names()
The dataset was cleaned and adjusted for this analysis. All dogs whose breed were listed as “unknown” were removed, as well as any dogs who were listed as over 20 years old at the time of their license. Originally there were no boroughs listed in this dataset, so we used the available zip codes to assign boroughs to each record. Records whose zip code did not correlate to a borough were dropped. A variable license_length was added to calculate the length in years of each license. There were slight differences in the way breed names were recorded, these were adjusted to be uniform.
Variables were changed to appropriate data type where necessary.
The main database included the same dog multiple times if she had more than one license. We created a subset to catalog only the distinct dogs present in our dataset. To identify individual dogs we matched records on animal name, gender, birth year, and breed name. We also added latitude and longitude data that corresponds to zip code for mapping purposes.
\(~\)
plot_2=
distinct_dogz %>%
filter(animal_gender == "M" | animal_gender == "F") %>%
ggplot(aes(x = borough, fill = animal_gender)) +
geom_histogram(position = "dodge", stat = "count") +
labs(
title = "Gender of dogs by borough",
x = "New York Borough",
y = "Number of dogs"
)
plot_2

Based on this bar graph plotting borough by frequency of dogs, we can see that overall, Manhattan has the most dogs (over 20,000 dogs for each gender) followed by Brooklyn, Queens, The Bronx, and Staten Island (with about 5,000 dogs for each gender). Additionally, among all boroughs there are more male dogs than there are female dogs.
\(~\)
plot_3=
distinct_dogz %>%
filter(dog_age < 20) %>%
ggplot(aes(x = borough, y = dog_age)) +
geom_boxplot (color = "purple") +
labs(
title = "Age distribution of dogs by borough",
x = "New York Borough",
y = "Dog's Age"
)
plot_3

Based on this box and whisker plot, plotting borough dog’s age in years, we can see that overall, the median age for dogs is about 6 years. The Bronx, Brooklyen, and Queens have similar distributions which are right skewed while Manhattan has a smaller range of values. Staten Island has the oldest dogs ranging above 10 years with no outliers shown.
\(~\)
table2 = distinct_dogz %>%
group_by(dog_age) %>%
count(breed_name) %>%
top_n(3)
knitr::kable(table2)
| dog_age | breed_name | n |
|---|---|---|
| 1 | Labrador Retriever Crossbreed | 231 |
| 1 | Shih Tzu | 400 |
| 1 | Yorkshire Terrier | 569 |
| 2 | American Pit Bull Mix / Pit Bull Mix | 448 |
| 2 | Shih Tzu | 564 |
| 2 | Yorkshire Terrier | 813 |
| 3 | American Pit Bull Mix / Pit Bull Mix | 697 |
| 3 | Shih Tzu | 579 |
| 3 | Yorkshire Terrier | 754 |
| 4 | American Pit Bull Mix / Pit Bull Mix | 896 |
| 4 | Shih Tzu | 472 |
| 4 | Yorkshire Terrier | 548 |
| 5 | American Pit Bull Mix / Pit Bull Mix | 956 |
| 5 | Chihuahua | 529 |
| 5 | Shih Tzu | 507 |
| 6 | American Pit Bull Mix / Pit Bull Mix | 778 |
| 6 | Chihuahua | 542 |
| 6 | Yorkshire Terrier | 481 |
| 7 | American Pit Bull Mix / Pit Bull Mix | 619 |
| 7 | Chihuahua | 445 |
| 7 | Shih Tzu | 471 |
| 8 | American Pit Bull Mix / Pit Bull Mix | 526 |
| 8 | Shih Tzu | 404 |
| 8 | Yorkshire Terrier | 452 |
| 9 | American Pit Bull Mix / Pit Bull Mix | 439 |
| 9 | Shih Tzu | 431 |
| 9 | Yorkshire Terrier | 440 |
| 10 | Chihuahua | 383 |
| 10 | Shih Tzu | 393 |
| 10 | Yorkshire Terrier | 406 |
| 11 | Chihuahua | 368 |
| 11 | Shih Tzu | 381 |
| 11 | Yorkshire Terrier | 390 |
| 12 | Chihuahua | 333 |
| 12 | Shih Tzu | 364 |
| 12 | Yorkshire Terrier | 391 |
| 13 | Chihuahua | 260 |
| 13 | Shih Tzu | 310 |
| 13 | Yorkshire Terrier | 290 |
| 14 | Chihuahua | 238 |
| 14 | Shih Tzu | 270 |
| 14 | Yorkshire Terrier | 254 |
| 15 | Chihuahua | 230 |
| 15 | Shih Tzu | 188 |
| 15 | Yorkshire Terrier | 207 |
| 16 | Chihuahua | 147 |
| 16 | Shih Tzu | 160 |
| 16 | Yorkshire Terrier | 159 |
| 17 | Chihuahua | 105 |
| 17 | Shih Tzu | 141 |
| 17 | Yorkshire Terrier | 106 |
| 18 | Chihuahua | 80 |
| 18 | Shih Tzu | 90 |
| 18 | Yorkshire Terrier | 71 |
| 19 | Chihuahua | 52 |
| 19 | Shih Tzu | 76 |
| 19 | Yorkshire Terrier | 51 |
| 20 | Chihuahua | 25 |
| 20 | Jack Russell Terrier | 23 |
| 20 | Shih Tzu | 23 |
| 21 | Jack Russell Terrier | 10 |
| 21 | Maltese | 9 |
| 21 | Shih Tzu | 12 |
| 22 | Bichon Frise | 7 |
| 22 | Chihuahua | 4 |
| 22 | Maltese | 4 |
| 22 | Poodle, Toy | 5 |
In this table, we see the top 3 dog breeds for each age, dog ages 1 - 22. We can see that there are more larger dog breeds appearing as the top breed in some of the younger age categories. After 9 years, all of the top 3 dog breeds for each age consist of small dog breeds. This is consistent with literature findings that smaller dog breeds tend to live longer lives than larger dog breeds. In addition, we find that several dog breeds outshine others in being the top dog breed for many of the age categories including Chihuanhua, Yorkshire Terrier, and American Pit Bull Mix / Pit Bull Mix.
\(~\)
plot_4=
distinct_dogz %>%
group_by(borough) %>%
count(breed_name) %>%
top_n(5) %>%
ggplot(aes(x = breed_name, y = n)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_grid(~borough, scales = "free", space = "free") +
labs(
title = "Top 5 breeds of dogs by borough",
x = "Dog Breed",
y = "Total Count"
)
plot_4

Based on this plot, plotting dog breed by the total count, the most common dog breeds are American Pit Bull Mix/Pit Bull Mix, Chihuahua, Maltese, Shih Tzu, and Yorkshire Terrier. American Pit Bull Mix/Pit Bull Mix has the highest observed frequency in Brooklyn (n =2000) while the lowest in Staten Island (n = 600). Additionally, in Manhattan, Chihuahua’s, and Yorkshire Terriers have the highest observed frequency.
\(~\)
plot_5.5=
dogz %>%
mutate(license_issued_date = floor_date(as_date(license_issued_date), "month")) %>%
group_by(license_issued_date) %>%
summarize(n_obs = n()) %>%
ggplot(aes(x = license_issued_date, y = n_obs, color = n_obs)) +
geom_point() + geom_line() +
labs(title = "Number of licenses over time",
x = "License issue date",
y = "Number of licenses")
plot_5.5

Based on this line graph, license issue date by the number of licenses, the highest frequency of licences were issued between 2016 - 2018 while the fewest number of licenses were issued from 2015-2016. This could be due to the fact that the license data was captured from 2017 - 2019, but there was still data with earlier license dates captured.
\(~\)
plot_5.51=
dogz %>%
mutate(license_issued_date = month(license_issued_date),
license_issued_date = month.abb[license_issued_date],
license_issued_date = factor(license_issued_date, c("Jan","Feb", "Mar", "Apr", "May", "Jun", "Jul","Aug", "Sep", "Oct","Nov", "Dec"), ordered = TRUE)) %>%
group_by(license_issued_date) %>%
summarize(n_obs = n()) %>%
ggplot(aes(x = license_issued_date, y = n_obs, color = n_obs)) +
geom_point() + geom_line(group=1) +
labs(title = "Number of licenses issued per month",
x = "License issue date",
y = "Number of licenses")
plot_5.51

Based on this line graph, there is an increasing trend present in regards to the number of licenses issued per month. From January - March there is a steep increase which increases more gradually between March through June. There is a spike from June to July and a plateau from July to August. From August, there is a steady decrease until November where we then see an increase in December. There could be a variety of factors to explain this phenomenon, but one might be that in the warmer months individuals are more likely to want to adopt/buy a dog where it is easier to upkeep a dog as opposed to the colder months across NYC.
\(~\)
top_dogz =
distinct_dogz %>%
filter(animal_name != "Unknown",
animal_name != "Name not provided") %>%
group_by(animal_name) %>%
summarize(n_obs = n()) %>%
arrange(desc(n_obs)) %>%
top_n(10)
knitr::kable(top_dogz)
| animal_name | n_obs |
|---|---|
| Max | 720 |
| Charlie | 696 |
| Bella | 667 |
| Coco | 633 |
| Lucy | 568 |
| Lola | 539 |
| Rocky | 535 |
| Bailey | 529 |
| Buddy | 503 |
| Lucky | 497 |
In this table, we see the 10 most popular dog names among all of the data in this dataset. Max ranks in first place with 720 counts among the entire dataset. It is interesting to note that several of the dog names could be gender neutral among this top 10 list.
\(~\)
plot_6 =
distinct_dogz %>%
filter(animal_name != "Unknown",
animal_name != "Name not provided",
animal_name != "Name",
animal_birth_year != 1997) %>%
group_by(animal_birth_year, animal_name) %>%
summarize(n_obs = n()) %>%
top_n(1) %>%
ggplot(aes(x=animal_birth_year, y=n_obs, group = animal_name, color = animal_name)) + geom_point() + geom_line() +
labs(title = "Most popular dog name by birth year",
x = "Birth year",
y = "Number of dogs")
plot_6

Based on this line graph, we can see the number of dogs that had a particular name based on birth year from 2000 to 2015. Names like Coco, Max, and Bella had an increase between 2000 - 2010 while names like Charlie gained popularity starting around 2007 all the way until about 2016. Additionally, while names like Lucky and Lucy were popular, they spanned a short duration of about a year. An interesting trend is that there has been a decrease in the popularity of the name Luna from 2016.
\(~\)
pal = colorNumeric(
palette = "viridis",
domain = distinct_dogz$zip_code)
distinct_dogz %>%
group_by(breed_name) %>%
mutate(click_label = str_c("<b> year(s)", dog_age, "</b><br>", animal_birth_year, "born <br>", animal_gender, " gender")) %>%
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(~longitude, ~latitude, radius = .1, color = ~pal(zip_code), popup = ~click_label)